Power Minimisation of VLSI Wave Digital Filters through Systolic Block Size Selection

نویسنده

  • P. Israsena
چکیده

Systolic architectures for Wave Digital Filters are investigated for low power applications. Based on a 3-port adaptor implementation of the 2 order section, minimum power is found using pipelining with a 2-bit block size for which the power consumption is reduced by 50 % and the power-area-delay performance increased by 5 times relative to the starting, non-pipelined implementation. Introduction Wave Digital Filters, especially the Lattice forms which are a Parallel Combination of Allpass Subfilters (PCAS), are becoming established for various signal processing applications [1,2]. Because of their low coefficient sensitivity and regularity the filters are ideal for VLSI implementation. In earlier work, the 2-port adaptor had been employed to realise the 2 order section the filter’s fundamental building block. Recent work, however, suggests the superiority of a 3-port implementation especially in term of speed [3]. With an increasing demand for portability, low power implementation of the filters is now relevant. This letter explores the effect of different levels of pipelining on the area, speed and power. Pipelined architectures The 3-port adaptor implements the 3 functions: ( ) 3 1 2 1 1 1 x x x x y − − − = γ (1a) ( ) 3 1 2 2 2 2 x x x x y − − − − = γ (1b) 2 1 3 1 2 3 y y x x x y − − − − = (1c) If ( ) ( ) 1 1 1 − = t y t x and ( ) ( ) 1 2 2 − = t y t x then y3/x3 is a 2 order all pass function [2] with –2 <γ1 <0 and –2 <γ2 <0. The hardware realisation of equations 1 can have a structure as shown in Figure 1a. It consists of 2 symmetrically interconnected arithmetic units, labeled A and B. Within each unit one multiplication and 3 summations/subtractions are performed. In analyzing the circuit we focus our attention on the critical path which involves the cross-connection between A and B. Using carry-save arithmetic unit B can be bit sliced with the circuit shown in Figure 1b. Unit A is similar, differing only in the position of negations. A and B can be realised as arrays of d+3 bit slices where d is the data word length. Each array has latency p+2 where p is the coefficient word length. That is to say that the least significant bit of the properly truncated or rounded data input appears at bit slice p+3. Therefore the complete unit requires p+d+2 bit slices. In what follows we consider 20-bit input and 8 bit coefficients, enough to satisfy modern signal processing requirements. Figure 2 depicts the system and with no pipelining clearly has a critical delay of p+d+2=30 units. Pipeline registers can be introduced into the A and B arrays in ways that ensure a consistent data format in the recursive loop. The first level of such pipelining is achieved by placing registers at the positions marked (i) in Figure 2. This will be known hereafter as architecture B3 (3 blocks), which has a critical path of 10 units. However as there are now 2 registers in the recursive loop, two clock cycles are required to compute one filter sample. The next pipelining level is achieved by adding additional registers at positions (ii), this is architecture B6 which has critical path 5 and 3 registers in the recursive loop. Two further pipelined architectures are also studied, B15 with registers every 2 slices and 6 delays in the loop and B30 with registers between every slice and 11 delays in the loop. The number of loop delays is apparent by examining the path between bit slice of equal significant from unit A to unit B, or vice versa. The number of loop delays greater than one allows the architecture to be multiplexed ,or interleaved, that is to say to process that number of independent data streams per clock cycle. In the implementation of a filter of sufficient order this implies a reduction in the total system area by the same factor. Results The above architectures, and also the starting, non-pipelined case which we call B1, are implemented with Cadence Design Framework II using 1 μm ES2 standard cell CMOS technology. The layouts are automatically generated and Verilog simulation with large number of randomly generated test vectors are performed, with extracted capacitances included. Minimum clock period is found by direct search and average power is determined by post processing a node activity dump and incorporating average cell power as quoted in the data library. The results are shown in table1. The minimum clock period decreases with the number of bit slices in the critical path of the architecture, but not strictly in proportion, due to the effects of differing sum and carry delays for the standard cell used and the latch delays and setup times. The simulated power multiplied by the clock period is the energy per filter sample because the number of clock cycles taken to complete one filter sample (column 2) is exactly compensated for by the multiplexing capability discussed above. The area is the bounding box of the synthesized layout. The effective area is the area divided by the multiplex factor, which is used in calculating the power-area-delay product shown in the last column of the table. We have isolated the contributions to the power dissipation from different parts of the circuits. This is shown in Figure 3a. Pcell decreases with the number of bit slices in the register transfer paths. Because in each case the number of bit level operations required to complete the filter cycle is the same, this reduction is entirely attributable to a reduction in glitch activity [4]. This effect competes with the increasing power used in the pipeline registers, Pclk, leading to, at least in principle, a minimum in the total power. This occurs here and to seen for the architecture B15. Comparing the optimum architecture B15 to the starting architecture B1 we see that the power is approximately 50% and the power-are-delay product is increased by a factor of approximately 5. These comparisons are all for a single fixed supply voltage. The effect of varying Vdd is shown in Figure 3b. The minimum is less pronounced at lower supply voltages. The comparisons with supply voltage scaling [5] to normalised maximum speed are to be consider elsewhere. However, we feel that the system design contexts of VLSI filters is one in which a single or at best limited number of supply voltages would be available. Conclusions This letter presents a low power, high speed pipelining solution for implementing a 2 order allpass section using 3-port adaptor. In this work, we have shown that a lowest power solution is achieved by selecting 2-bit level pipelining, which is also shown to give optimal power-area-delay performance . References [1] P.A. REGALIA, S.K. MITRA and P.P VAIDYANATHAN: “The Digital All-Pass Filter: A Versatile Signal Processing Block”, IEEE Proceedings, Vol.76, No.1,pp.19-37, January 1988 [2] S.S LAWSON and A.R. MIRZAI: “ Wave Digital Filters”, Ellis-Horwoods, 1990 [3] M. ANDERSON, S. SUMMETFIELD and S. LAWSON: “ Realisation of lattice wave digital filters using three-port adaptors”, Electronics Letters, Vol.31, No. 8, pp. 628-629 April 1995 [4] M. FAVALLI and L. BENINI: “Analysis of Glitch power dissipation in CMOS ICs”, Proc. International Workshops on Low Power Design, pp. 123-128, 1995 [5] A. P. CHANDRAKASARN, S. SHENG and R.W. BRODESSEN: “Low-Power CMOS Digital Design”, IEEE J. of Solid State Circuits, vol. 27, No.4, pp. 473-483 April 1992. Figure captions Figure 1. 2order allpass section using the 3-port adaptor a) Signal flow graph in modular form b) Bit-slice of unit A for coefficient word length p=3. The large circles represent full adders, the solid squares AND gates. C0-C4 are the carry outputs. The dependencies refer to bit indices. Figure 2. The adaptor critical path in bit-sliced form Figure 3. Power simulation results a) Contribution from the logic (Pcell), the registers (Pff), the clock drive (Pclk) and the routing (Pnet) b) Total power for different supply voltages Table captions Table 1: Simulation results for standard cell designs of the various architectures studied. Architecture No. of loop delays Maximum clock period (nS) Energy/ Sample (pJ) Area (mm) Effective area Aeff Maximum sample rate (MHz) P.Aeff.D (relative units) B1 1 39.0 7023 1.38 1.38 25.6 100 B3 2 24.0 4681 1.64 0.82 20.8 48 B6 3 14.5 3679 1.90 0.63 23.0 26 B15 6 8.30 3485 2.76 0.46 20.0 21 B30 11 5.80 4476 4.04 0.37 15.6 28 Table 1

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A new VLSI architecture without global broadcast for 2-D digital filters

In this paper, we propose the new two-dimensional (2-D) systolic-array structures of IIR/FIR digital filters without global broadcast by the different derivation and another systolic transformation. For more practical considerations, we further provide a detailed block diagram of a 2-D FIR filter using recently proposed multiplier to reduce the roundoff quantization error in the logic-gate leve...

متن کامل

A High Throughput Systolic Implementation of the Second Order Recursive Filter

There have been many 2D (two–dimensional) VLSI structures introduced in the literature for 1D (one–dimensional) recursive digital filters with high throughput. The technique applied for the implementations is mainly based on block–state–variable filter descriptions. This paper introduces a high throughput systolic implementation of direct form second order recursive filters. The systolic struct...

متن کامل

Design and Implementation of a High Speed Systolic Serial Multiplier and Squarer for Long Unsigned Integer Using VHDL

A systolic serial multiplier for unsigned numbers is presented which operates without zero words inserted between successive data words, outputs the full product and has only one clock cycle latency. The multiplier is based on a modified serial/parallel scheme with two adjacent multiplier cells. Systolic concept is a well-known means of intensive computational task through replication of func...

متن کامل

Design and Implementation of a High Speed Systolic Serial Multiplier and Squarer for Long Unsigned Integer Using VHDL

A systolic serial multiplier for unsigned numbers is presented which operates without zero words inserted between successive data words, outputs the full product and has only one clock cycle latency. &#10The multiplier is based on a modified serial/parallel scheme with two adjacent multiplier cells. Systolic concept is a well-known means of intensive computational task through replication of fu...

متن کامل

VLSI Design of a High Performance Decimation Filter Used for Digital Filtering

With the rapid development of computers and communications, more and more chips are required to have small size, low-power and high performance. Digital filter is one of the basic building blocks used for implementation in Very Large Scale Integration (VLSI) of mixed-signal circuits. This paper presents a design of decimation filter used for digital filtering. It consists of Cascode Integrated ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007